367 research outputs found

    Early register release for out-of-order processors with register windows

    Get PDF
    Register windows is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper we propose two early register release techniques that leverages register windows to drastically reduce the register requirements, and hence reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the same number as logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Time-predictable parallel programming models

    Get PDF
    Embedded Computing (EC) systems are increas-ingly concerned with providing higher performance in real-time while HPC applications require huge amounts of information to be processed within a bounded amount of time. Addressing this convergence and mixed set of requirements needs suitable programming methodologies to exploit the massively parallel computation capabilities of the available platforms in a pre-dictable way. OpenMP has evolved to deal with the programma-bility of heterogeneous many-cores, with mature support for fine-grained task parallelism. Unfortunately, while these features are very relevant for EC heterogeneous systems, often modeled as periodic task graphs, both the OpenMP programming interface and the execution model are completely agnostic to any timing requirement that the target applications may have. The goal of our work is to enable the use of the OpenMP parallel programming model in real-time embedded systems, such that many-cores architectures can be adopted in critical real-time embedded systems. To do so, it is required to guarantee the timing behavior of OpenMP applications

    Techniques for reducing and bounding OpenMP dynamic memory

    Get PDF
    OpenMP offers a tasking model very convenient to develop critical real-time parallel applications by virtue of its time predictability. However, current implementations make an intensive use of dynamic memory to efficiently manage the parallel execution. This jeopardizes the qualification process and limits the use of OpenMP in architectures with limited amount of memory. This work introduces an OpenMP framework that statically allocates the data structures needed to efficiently manage parallel execution in OpenMP programs. We achieve the same performance than current implementations, while bounding and reducing the dynamic memory requirements at runtime

    Cine-forum: “La felicidad en el séptimo arte"

    Get PDF
    Comunicación presentada en el Curso "La felicidad humana", dentro de los Cursos de verano UBU 201

    Modeling high-performance wormhole NoCs for critical real-time embedded systems

    Get PDF
    Manycore chips are a promising computing platform to cope with the increasing performance needs of critical real-time embedded systems (CRTES). However, manycores adoption by CRTES industry requires understanding task's timing behavior when their requests use manycore's network-on-chip (NoC) to access hardware shared resources. This paper analyzes the contention in wormhole-based NoC (wNoC) designs - widely implemented in the high-performance domain - for which we introduce a new metric: worst-contention delay (WCD) that captures wNoC impact on worst-case execution time (WCET) in a tighter manner than the existing metric, worst-case traversal time (WCTT). Moreover, we provide an analytical model of the WCD that requests can suffer in a wNoC and we validate it against wNoC designs resembling those in the Tilera-Gx36 and the Intel-SCC 48-core processors. Building on top of our WCD analytical model, we analyze the impact on WCD that different design parameters such as the number of virtual channels, and we make a set of recommendations on what wNoC setups to use in the context of CRTES.Peer ReviewedPostprint (author's final draft

    A static scheduling approach to enable safety-critical OpenMP applications

    Get PDF
    Parallel computation is fundamental to satisfy the performance requirements of advanced safety-critical systems. OpenMP is a good candidate to exploit the performance opportunities of parallel platforms. However, safety-critical systems are often based on static allocation strategies, whereas current OpenMP implementations are based on dynamic schedulers. This paper proposes two OpenMP-compliant static allocation approaches: an optimal but costly approach based on an ILP formulation, and a sub-optimal but tractable approach that computes a worst-case makespan bound close to the optimal one.This work is funded by the EU projects P-SOCRATES (FP7-ICT-2013-10) and HERCULES (H2020/ICT/2015/688860), and the Spanish Ministry of Science and Innovation under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Big Data Analytics for Smart Cities: The H2020 CLASS Project

    Get PDF
    Applying big-data technologies to field applications has resulted in several new needs. First, processing data across a compute continuum spanning from cloud to edge to devices, with varying capacity, architecture etc. Second, some computations need to be made predictable (real-time response), thus supporting both data-in-motion processing and larger-scale data-at-rest processing. Last, employing an event-driven programming model that supports mixing different APIs and models, such as Map/Reduce, CEP, sequential code, etc.The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the CLASS Project (www.class-project.eu), grant agreement No. 780622.Peer ReviewedPostprint (author's final draft

    Improving performance guarantees in wormhole mesh NoC designs

    Get PDF
    Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.The research leading to these results is funded by the European Union Seventh Framework Programme under grant agreement no. 287519 (parMERASA) and by the Ministry of Science and Technology of Spain under contract TIN2012-34557. Milos Panic is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Carles Hernández is jointly funded by the Spanish Ministry of Economy and Competitiveness and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella is partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Taskgraph: A Low Contention OpenMP Tasking Framework

    Full text link
    OpenMP is the de-facto standard for shared memory systems in High-Performance Computing (HPC). It includes a task-based model that offers a high-level of abstraction to effectively exploit highly dynamic structured and unstructured parallelism in an easy and flexible way. Unfortunately, the run-time overheads introduced to manage tasks are (very) high in most common OpenMP frameworks (e.g., GCC, LLVM), which defeats the potential benefits of the tasking model, and makes it suitable for coarse-grained tasks only. This paper presents taskgraph, a framework that uses a task dependency graph (TDG) to represent a region of code implemented with OpenMP tasks in order to reduce the run-time overheads associated with the management of tasks, i.e., contention and parallel orchestration, including task creation and synchronization. The TDG avoids the overheads related to the resolution of task dependencies and greatly reduces those deriving from the accesses to shared resources. Moreover, the taskgraph framework introduces in OpenMP the record-and-replay execution model that accelerates the taskgraph region from its second execution. Overall, the multiple optimizations presented in this paper allow exploiting fine-grained OpenMP tasks to cope with the trend in current applications pointing to leverage massive on-node parallelism, fine-grained and dynamic scheduling paradigms. The framework is implemented on LLVM 15.0. Results show that the taskgraph implementation outperforms the vanilla OpenMP system in terms of performance and scalability, for all structured and unstructured parallelism, and considering coarse and fine grained tasks. Furthermore, the proposed framework considerably reduces the performance gap between the task and the thread models of OpenMP
    • …
    corecore